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Abstract. We study the problem of learning in the presence of a drift¬ 
ing target concept. Specifically, we provide bounds on the error rate at a 
given time, given a learner with access to a history of independent sam¬ 
ples labeled according to a target concept that can change on each round. 
One of our main contributions is a refinement of the best previous results 
for polynomial-time algorithms for the space of linear separators under 
a uniform distribution. We also provide general results for an algorithm 
capable of adapting to a variable rate of drift of the target concept. Some 
of the results also describe an active learning variant of this setting, and 
provide bounds on the number of queries for the labels of points in the 
sequence sufficient to obtain the stated bounds on the error rates. 


1 Introduction 

Much of the work on statistical learning has focused on learning settings in 
which the concept to be learned is static over time. However, there are many 
application areas where this is not the case. For instance, in the problem of 
face recognition, the concept to be learned actually changes over time as each 
individual’s facial features evolve over time. In this work, we study the problem 
of learning with a drifting target concept. Specifically, we consider a statistical 
learning setting, in which data arrive i.i.d. in a stream, and for each data point, 
the learner is required to predict a label for the data point at that time. We 
are then interested in obtaining low error rates for these predictions. The target 
labels are generated from a function known to reside in a given concept space, 
and at each time t the target function is allowed to change by at most some 
distance At: that is, the probability the new target function disagrees with the 
previous target function on a random sample is at most At- 

This framework has previously been studied in a number of articles. The 
classic works of [HL91IHL94IBH96ILon99IBBDK00] and |BL97] together provide 
a general analysis of a very-much related setting. Though the objectives in these 
works are specified slightly differently, the results established there are easily 
translated into our present framework, and we summarize many of the relevant 
results from this literature in Section [31 
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While the results in these classic works are general, the best guarantees on 
the error rates are only known for methods having no guarantees of computa¬ 
tional efficiency. In a more recent effort, the work of [CMEDVIO] studies this 
problem in the specific context of learning a homogeneous linear separator, when 
all the At values are identical. They propose a polynomial-time algorithm (based 
on the modified Perceptron algorithm of [DKM09) L and prove a bound on the 
number of mistakes it makes as a function of the number of samples, when the 
data distribution satisfies a certain condition called “A-good” (which general¬ 
izes a useful property of the uniform distribution on the origin-centered unit 
sphere). However, their result is again worse than that obtainable by the known 
computationally-inefficient methods. 

Thus, the natural question is whether there exists a polynomial-time algo¬ 
rithm achieving roughly the same guarantees on the error rates known for the 
inefhcient methods. In the present work, we resolve this question in the case 
of learning homogeneous linear separators under the uniform distribution, by 
proposing a polynomial-time algorithm that indeed achieves roughly the same 
bounds on the error rates known for the inefficient methods in the literature. 
This represents the main technical contribution of this work. 

We also study the interesting problem of adaptivity of an algorithm to the 
sequence of At values, in the setting where At may itself vary over time. Since 
the values At might typically not be accessible in practice, it seems important 
to have learning methods having no explicit dependence on the sequence At. 
We propose such a method below, and prove that it achieves roughly the same 
bounds on the error rates known for methods in the literature which require 
direct access to the At values. Also in the context of variable At sequences, we 
discuss conditions on the sequence At necessary and sufficient for there to exist 
a learning method guaranteeing a sublinear rate of growth of the number of 
mistakes. 

We additionally study an active learning extension to this framework, in 
which, at each time, after making its prediction, the algorithm may decide 
whether or not to request access to the label assigned to the data point at 
that time. In addition to guarantees on the error rates (for all times, including 
those for which the label was not observed), we are also interested in bounding 
the number of labels we expect the algorithm to request, as a function of the 
number of samples encountered thus far. 


2 Definitions and Notation 

Formally, in this setting, there is a fixed distribution V over the instance space A, 
and there is a sequence of independent ^-distributed unlabeled data Xi, X 2 ,.... 
There is also a concept space C, and a sequence of target functions h* = 
{h\,h 2 T ■ in C. Each t has an associated target label Yt = hl{Xt). In this 
context, a (passive) learning algorithm is required, on each round t, to pro¬ 
duce a classifier ht based on the observations (Xi, Fi),..., (Xt_i, T^-i), and 
we denote by Yt = ht{Xt) the corresponding prediction by the algorithm for 
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the label of Xt. For any classifier h, we define eit{h) = V{x : h{x) ^ h*{x)). 
We also say the algorithm makes a “mistake” on instance Xt if Yt ^ Yt\ thus, 
evt{ht)=nYt^Yt\{Xt,Yi),...,{Xt-uYt-i)). 

For notational convenience, we will suppose the sequence is chosen in¬ 
dependently from the Xt sequence (i.e., ht is chosen prior to the “draw” of 
Xi,X 2 ,... ^V), and is not random. 

In each of our results, we will suppose h* is chosen from some set S of 
sequences in C. In particular, we are interested in describing the sequence h* in 
terms of the magnitudes of changes in ht from one time to the next. Specifically, 
for any sequence A = {Z\t }“2 in [0:1]: denote by the set of all sequences 

h* in C such that, Vt G N, V{x : ht{x) ^ ht+i(.x)) < At+i. 

Throughout this article, we denote by d the VC dimension of C |VC71] . and 
we suppose C is such that I < d < oo. Also, for any x G R, define Log(x) = 
ln(max{x, e}). 

3 Background: (e, S')-Tracking Algorithms 

As mentioned, the classic literature on learning with a drifting target concept is 
expressed in terms of a slightly different model. In order to relate those results to 
our present setting, we first introduce the classic setting. Specifically, we consider 
a model introduced by [HL94| . presented here in a more-general form inspired 
by [BBDKOO] . For a set S of sequences {ht}'^i in C, and a value e > 0, an 
algorithm A is said to be (e, S)-tracking if G N such that, for any choice of 
h* G S', VT > te, the prediction Yt produced by A at time T satisfies 

P (Yt + Yt^ < e. 

Note that the value of the probability in the above expression may be influenced 
by {Xt}f^i, {ht}f^i, and any internal randomness of the algorithm A. 

The focus of the results expressed in this classical model is determining suf¬ 
ficient conditions on the set S for there to exist an (e, S)-tracking algorithm, 
along with bounds on the sufficient size of These conditions on S typically 
take the form of an assumption on the drift rate, expressed in terms of e. Below, 
we summarize several of the strongest known results for this setting. 

3.1 Bounded Drift Rate 

The simplest, and perhaps most elegant, results for (e, 5')-tracking algorithms 
is for the set S of sequences with a bounded drift rate. Specifically, for any 
A G [0,1], define Sa = Sa, where A is such that At+i = A for every t G N. 
The study of this problem was initiated in the original work of |HL94) . The 
best known general results are due to [Lon99) : namely, that for some A^^ = 
0{e‘^/d), for every e G (0,1], there exists an (e, S' 2 i)-tracking algorithm for all 
values of Z\ < aM This refined an earlier result of [HL94) by a logarithmic 


In fact, |Lon99 | also allowed the distribution V to vary gradually over time. For 
simplicity, we will only discuss the case of fixed V. 
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factor. [Lon99) further argued that this result can be achieved with = 0{d/e). 
The algorithm itself involves a beautiful modification of the one-inclusion graph 
prediction strategy of [HLW94] : since its specification is somewhat involved, we 
refer the interested reader to the original work of |Lon99] for the details. 


3.2 Varying Drift Rate: Nonadaptive Algorithm 

In addition to the concrete bounds for the case h* G [HL94] additionally 
present an elegant general result. Specifically, they argue that, for any e > 0, 
and any m = I?(|Logi), if ^ me/24, then for 

h = argmin;,gc 1 ^ ^ < ell This re¬ 

sult immediately inspires an algorithm A which, at every time t, chooses a value 
mt < t-l, and predicts Yt = ht{Xt), for ht = argmin^g^ 7^ ^i]- 

We are then interested in choosing mt to minimize the value of e obtainable via 
the result of [HL94) . However, that method is based on the values V{x : h*{x) ^ 
h*{x)), which would typically not be accessible to the algorithm. However, sup¬ 
pose instead we have access to a sequence A such that h* G Sa- In this case, 
we could approximate V{x : h*{x) ^ h^{x)) by its upper bound J2*j=i+i 
this case, we are interested choosing mt to minimize the smallest value of e such 
that E5 =z+i Aj < mte/2A and mt = (|Logi). One can easily verify 

that this minimum is obtained at a value 


mt =0 


( . 1 

argmm — 

\ m<t—l ^ 


E E A + 


dLog{m/d) 


and via the result of |HL94j (applied to the sequence Xt-mt ) ■ • ■) the resulting 
algorithm has 


(Yt y^Yt)<0 


1 


t-l t 

min — 
i<m<t-i m 

i—t—m j—i-\-l 


At -\- 


dLog{m/d) 


( 1 ) 


As a special case, if every t has At = A for a fixed value A G [0,1], this 
result recovers the bound y^dZ\Log(l/A), which is only slightly larger than that 
obtainable from the best bound of |Lon99) . It also applies to far more general 
and more intersting sequences A, including some that allow periodic large jumps 
(i.e.. At = 1 for some indices t), others where the sequence At converges to 0, and 
so on. Note, however, that the algorithm obtaining this bound directly depends 
on the sequence A. One of the contributions of the present work is to remove this 
requirement, while maintaining essentially the same bound, though in a slightly 
different form. 

® They in fact prove a more general result, which also applies to methods approxi¬ 
mately minimizing the number of mistakes, but for simplicity we will only discuss 
this basic version of the result. 
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3.3 Computational Efficiency 

[HL94] also proposed a reduction-based approach, which sometimes yields com¬ 
putationally efficient methods, though the tolerable A value is smaller. Specifi¬ 
cally, given any (randomized) polynomial-time algorithm A that produces a clas¬ 
sifier h G C with 7 ^ 2 /t] = 0 for any sequence (xi, j/i),..., {xm,ym) 

for which such a classifier h exists (called the consistency problem), they propose 
a polynomial-time algorithm that is (e, iS'/i)-tracking for all values of Z\ < Z\(, 
where A'^ = 0 This is slightly worse (by a factor of dLog(l/e)) 

than the drift rate tolerable by the (typically inefficient) algorithm mentioned 
above. However, it does sometimes yield computationally-efficient methods. For 
instance, there are known polynomial-time algorithms for the consistency prob¬ 
lem for the classes of linear separators, conjunctions, and axis-aligned rectangles. 


3.4 Lower Bounds 

|HL94] additionally prove lower bounds for specific concept spaces: namely, linear 
separators and axis-aligned rectangles. They specifically argue that, for C a 
concept space 


BASIC„ = {U'LJz/n, {i + a,)/n) : a e [0,1]"} 

on [0,1], under V the uniform distribution on [0,1], for any e S [0,1/e^] and 
Ae > e^^jn, for any algorithm A, and any T G N, there exists a choice 
of h* G Sa^ such that the prediction Yt produced by A at time T satisfies 

P (yt ^ Yt^ > e. Based on this, they conclude that no (e, ^/ij-tracking algo¬ 
rithm exists. Furthermore, they observe that the space BASIC„ is embeddable 
in many commonly-studied concept spaces, including halfspaces and axis-aligned 
rectangles in R", so that for C equal to either of these spaces, there also is no 
(e, S/i J-tracking algorithm. 


4 Adapting to Arbitrarily Varying Drift Rates 

This section presents a general bound on the error rate at each time, expressed 
as a function of the rates of drift, which are allowed to be arbitrary. Most- 
importantly, in contrast to the methods from the literature discussed above, the 
method achieving this general result is adaptive to the drift rates, so that it 
requires no information about the drift rates in advance. This is an appealing 
property, as it essentially allows the algorithm to learn under an arbitrary se¬ 
quence h* of target concepts; the difficulty of the task is then simply reflected 
in the resulting bounds on the error rates: that is, faster-changing sequences of 
target functions result in larger bounds on the error rates, but do not require a 
change in the algorithm itself. 







6 


Steve Hanneke, Varun Kanade, and Liu Yang 


4.1 Adapting to a Changing Drift Rate 

Recall that the method yielding ([T]) (based on the work of [HLQlj l required 
access to the sequence ZX of changes to achieve the stated guarantee on the 
expected number of mistakes. That method is based on choosing a classifier 
to predict 1* by minimizing the number of mistakes among the previous rrit 
samples, where rrit is a value chosen based on the zX sequence. Thus, the key to 
modifying this algorithm to make it adaptive to the zX sequence is to determine 
a suitable choice of m* without reference to the A sequence. The strategy we 
adopt here is to use the data to determine an appropriate value rht to use. 
Roughly (ignoring logarithmic factors for now), the insight that enables us to 
achieve this feat is that, for the rrit used in the above strategy, one can show that 
(^i) is roughly 0{d), and that making the prediction Y* with 
any h G C with roughly 0{d) mistakes on these samples will suffice to obtain the 
stated bound on the error rate (up to logarithmic factors). Thus, if we replace 
mt with the largest value m for which min/igc Si=t-m is roughly 

0{d), then the above observation implies m > mt- This then implies that, for 
h = argmin^gc Yi], we have that F*] is 

also roughly 0((i), so that the stated bound on the error rate will be achieved 
(aside from logarithmic factors) by choosing ht as this classifier h. There are 
a few technical modifications to this argument needed to get the logarithmic 
factors to work out properly, and for this reason the actual algorithm and proof 
below are somewhat more involved. Specifically, consider the following algorithm 
(the value of the universal constant K >1 will be specified below). 


0. For T= 1,2,... 

1. Let Tot = max'( mG |1,..., T— 1| : min max 
' heCm'<m 


2 . 


Let hr = argmin jmx^ dL:,T^^M)+Log(i/S) 


dLog(m^/d}+Log(l / 6) 



Note that the classifiers ht chosen by this algorithm have no dependence on 
A, or indeed anything other than the data {{Xi,Yi) : i < t}, and the concept 
space C. 

Theorem 1. Fix any S G (0,1), and let A be the above algorithm. For any 
sequence ZX in [0,1], for any V and any choice ofh* G 5'^, for every T G N\{1}, 
with probability at least 1 — <5, 


erT 


(hr) 


^ T-1 T 

<0\ min — y y 

i—T—m 


A, 


dLogirn/d) + Log(l/i5) 


V 




Before presenting the proof of this result, we first state a crucial lemma, which 
follows immediately from a classic result of |Vap82|Vap98| , combined with the 
fact (from |Vid03| . Theorem 4.5) that the VC dimension of the collection of sets 
{{a: : h{x) g{x)} : h,g G C} is at most lOd. 
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Lemma 1. There exists a universal constant c € [1, oo) such that, for any class 
C of VC dimension d, Vm G N, Vi5 G (0,1), with probability at least 1 — S, every 
h,g € C have 


^ IIL 

Vix : h{x) g{x)) -^ l[h{Xt) ^ g{Xt)] 

t=i 



' - m ' 

-J2mXt)^g{Xt)] 


dLog(m/d) + Log(l/(5) 
m 


dLog{m/d) + Log(l/5) 

+ c-. 

m 

We are now ready for the proof of Theorem [TJ For the constant K in the 
algorithm, we will choose K = 145c^, for c as in Lemma [TJ 

Proof (Proof of Theorem[Jj). Fix any T G N with T > 2, and define 


my = max "Im G {1,..., T — 1} : Vm' < m, 

T-l 

l[h^{Xt) Yt] < K{dljOg{m'/d) + Log(l/(5)) 

t—T—m' 

Note that 

T-l 

y] l[hUXt) Yt\ < K{dVog{nifld) + Log(l/<5)), (2) 

and also note that (since hy G C) my > my, so that (by definition of my and 
hr) 

T-l 

N MhriXt) ^ Yt] < K{dljOg{m*rp/d) + Log(l/i5)) 

t—T — 

as well. Therefore, 


T-l T-l T-l 

y] i[hfr{Xt)^hT{Xt)]< Y. WT{Xt)^Yt]Y Y m^hT^xt)] 

t—T—m^ t—T—m^ t—T—m^ 

< 2iC(dLog(my/c?) + Log(l/<5)). 

Thus, by Lemma (U for each m G N, with probability at least 1 — i5/(6m^), if 
my = m, then 


r{x : hrix) h^{x)) < {2K + cVm + c) 


dLog{nYf/d) + Log(6(my)^/i5) 
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Furthermore, since Log(6(m5-)^) < V‘2KdLog(m^/d), this is at most 


2{K + ^^) dLogimyd)+Logil/6) ^ 

mtp 

By a union bound (over values m G N), we have that with probability at least 

P(x : hr{.) ^ hi-i.)) < 2(K + + Log(l/.i) 

Let us denote 

1 dLog(m/d)+ Log(l/(5) 

rriT = argmin — > > Zi, H-. 

mpti T-n w , ^ m 

Li i=T-mj=i+l 

Note that, for any m' £ {1,..., T — 1} and 6 £ (0,1), if rfiT > then 

T-l T 

me{i,...,T-i} m 


1 W (iLog(TO/d)+Log(l/(5) 

mm — 2^ A +- 

i—T—m 


> min 


1 


T-l T 


m 

T-l T 


i—T—m 


E E A = E E A , 


i—T—m' 


while if friT < m', then 


1 ^ dLog(TO/d)+Log(l/(5) 

min — > > Zid H- 

i=T—m j—i+1 


m 


> min 


dLog{rn/d) + Log(l/(5) dLog{m^/d) + Log(l/5) 


m 


m 


Either way, we have that 


T-l T 


mm 
me{i,...,T-i} m 


7. E E A + 


fiLog(m/d) + Log(l/(5) 


i—T—m j—i-\-l 


m 

T-l T 


. I dLog(m'/d) + Log(l/d) 1 ^ 




'3 ( • 


i—T—m' j—i-\-l 


(3) 


For any m £ — 1}, applying Bernstein’s inequality (see [BLM13] . 

equation 2.10) to the random variables l\hip{Xi) ^ i £ {T—m,... ,T—1}, 

and again to the random variables —l[h'^{Xi) ^ Ej/d, i £ {T — to, ..., T — 1}, 
together with a union bound, we obtain that, for any S £ (0,1), with probability 
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at least 1 — 5/{Zm?) 

T-l 


^ ^ Ki^)) 


i—T—m 


\ 


T-l 


■ hj,{x) ^ h^ (x)) I — 


i—T—m 


T-l 


<- E 


i=^T—n 

T-l 


i—T—r 


<- E 'P{x-.hUx)^h*{x)) 

\j (m T,i=T-m Y’{X ■ Kp{x) ^ h* (x))^ 


4 ln.( 3 m 2 /( 5 ) 


max ■ 


( 4 / 3 ) ln( 3 m^/( 5 ) 
m 

The left inequality implies that 


( 4 ) 


— V{x:h^{x) ^ h*{x)) < max-j — Yj], A) y 

m m m 

i—T—m K i—T—m ) 

Plugging this into the right inequality in (|3]), we obtain that 

T-l T-l 

- y] Y. nx:h*Ax)^h*ix)) 


i—T—r 


i—T—r 


■ max ■ 


\ 


T-l 




81n(3TO^/(5) y/^\n.{3m'^/6) 


i—T—m 


By a union bound, this holds simultaneously for all m G {!,...,T — 1} with 
probability at least 1 — X]m=i i5/(3to^) > 1 — {2/3)S. Note that, on this event, 
we obtain 

T-l T-l 

- y] Vix:h*Ax)^h*ix))>- Y nhUX^)y^Y^ 

i—T—m i—T—m 


— max ■ 




- V 1 [I.MV.) 7 ^r.l) I ^ 

m . I m m \ 


i—T—m 


In particular, taking m = m^, and invoking maximality of m^, if rrYp < T — 1, 
the right hand side is at least 


{K - 6c-,T) ^Log(m^/d) + Log(l/(5) ^ 
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Since ^ E^T-m EJ=i+i ■ h*T{x) ^ h*{x)), taking 

K = 145c^, we have that with probability at least 1 — i5, if < T — 1, then 

m{K + cVm) min - E E + + 

i—T—m 


> 10(A' + OV®) min I m + 

> 10(A + ^)<iLc.g(m}/d) + L<,g(l/^) 


T-1 T ') 

’ ^ E A [ 


> V{x : hrix) ^ h^{x)). 


Furthermore, if = T — 1, then we trivially have (on the same 1 — <5 probability 
event as above) 

10(i^ + cvE?F) min IV E 

i—T—m 

>m(K + csm) min <'I-»g(m/<i) + Log(l/g) 

me{i,...,T-i} m 

= 10(A + ~ 

= 10(A' + ^^ ‘a^A^"‘‘TM + L»AUS) ^ ^ ^ 

□ 


4.2 Conditions Guaranteeing a Sublinear Number of Mistakes 

One immediate implication of Theorem [T] is that, if the sum of At values grows 
sublinearly, then there exists an algorithm achieving an expected number of 
mistakes growing sublinearly in the number of predictions. Formally, we have 
the following corollary. 

Corollary 1. d/E^i then there exists an algorithm A such that, 

for every V and every choice of h* € , 


E 


■ T 


Yt^Yt 


o{T). 


Proof. For every T gN with T > 2, let 


rriT = 


argmin 


1 


E 


E + 


^^Log(r7^/f^) + Logil/Sr) 


m 


m 
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and define 5t = ■ Then consider running the algorithm A from Theorem [U 

except that in choosing mx and hx for each T, we use the above value 5x in 
place of 5. Then Theorem [T] implies that, for each T, with probability at least 
1 — (5 t, 


T-l T 


eixihx) < O min — 

' l<m<T-l m 


E Ea 

i—T—m j—i+1 


dLog{m/d) +Log(l/i5T)\ 


Since eix{hx) < 1, this implies that 


{Yt Yx 


= E 


< O I min 

l<m<T-l m 


erx{hx) 

^ ^ dLog(TO/(i) + Log(l/i5T)\ 

m I 


- E E 

1 m ^ ^ 


5x 


= O I min 

l<m<T-l m 


i—T—m 
T-l T 


- E E 

1 m ^ ^ 

i—T—m j—i-\-l 


dLog{m/d) + Log(TO) 


and since x i—>■ xLog(m/x) is nondecreasing for x > 1, Log(TO) < dljOg{m/d), so 
that this last expression is 


O 


T-l 


mm 
l<m<T-l m 


E EA 

i—T—m j—i-\-l 


dLog{m/d) 


Now note that, for any t S N and m S {1,..., t — 1}, 


^ t— ± L ^ t— ± L L 

~ E E ~ E E = E 

s—t—mr—s-\-l s—t — mr—t—m-\-l r—t—m-\-l 

Let (3tim) = max jxiE-m+i and note that J2l=t-m+i^r + 

^ 2f3t{m). Thus, combining the above with @, linearity of expecta¬ 
tions, and the fact that the probability of a mistake on a given round is at most 
1, we obtain 


E 




= O ( min Ptim) A 1 
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Fixing any M G N, we have that for any T > M, 

T T 

min Mm) A 1 < M + (3t{M) A 1 


'dLog{M/d) 


M 

t 


t=M+l 

> jz M 

r—t—M+1 


dLog{M/d) 


M 


T 

< M + 1 

i^M+l 
T 

+ E 1 

i^M+l 

dLoglM/d)^ ^ M ^ 

- M ^ dLog(M/d) ^ 

dLog(M/d) 

--— T+SMin 


a ^ dLog(M/d) 
^ ^ M 

.r=i —M+1 


where gM is a function satisfying gM{T) = o(r) (holding M fixed). Since this i 
true of any M G N, we have that 

r ^ • o / ^ A 1 ^ 1- 1- dLog(M/d) gmir) 

hm — > mm pdm) A 1 < hm lim-1- 

T^cso E 1} M^ooT^oo ]\4 T 


= li^ dLog{M/d) 


so that E 



Yty^Yt 


o{T), as claimed. 


□ 


For many concept spaces of interest, the condition M = o{T) in Corol¬ 

lary [T] is also a necessary condition for any algorithm to guarantee a sublinear 
number of mistakes. For simplicity, we will establish this for the class of homoge¬ 
neous linear separators on with V the uniform distribution on the unit circle, 
in the following theorem. This can easily be extended to many other spaces, in¬ 
cluding higher-dimensional linear separators or axis-aligned rectangles in R^, by 
embedding an analogous setup into those spaces. 


Theorem 2. // T = {a; G R^ : ||a;|| = 1}, V is Uniform(T), and C = {x i—>■ 
> 0] — 1 : w G R^,||iy|| = 1} is the class of homogeneous linear sep¬ 
arators, then for any sequence A. in [0,1], there exists an algorithm A such 
that E 1 Ft ^ Ft = o{T) for every choice of h* G if and only if 

ELiA = o(r). 

Proof. The “if” part follows immediately from Corollary [T] For the “only if” 
part, suppose A is such that At ^ o{T). It suffices to argue that for any 

algorithm A, there exists a choice of h* G for which E Ym=i 1 Ft ^ 
o(T). Toward this end, fix any algorithm A. We proceed by the probabilistic 
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method, constructing a random sequence h* G S^. Let i?i, i? 2 , • • • be indepen¬ 
dent Bernoulli(l/2) random variables (also independent from the unlabeled data 
Xi,X 2 t ■ ■). We define the sequence h* inductively. For simplicity, we will rep¬ 
resent each classifier in polar coordinates, writing (for ^ G R) to denote the 
classifier that, for x = (xi, X 2 ), classifies x as /i 0 (x) = 21 [xi cos{(j)) + X 2 sin((()) > 
0] — 1; note that = h^+ 2 T^ for every </> G R. As a base case, start by defining 
a function Iiq = ho, and letting 00 = 0. Now for any t G N, supposing hf_i 
is already defined to be we define 0 t = 4>t-i + min{Z\t, l/ 2 } 7 ri 3 t, and 

hf = Note that V{x : /ij'(x) ^ h^_-^{x)) = min{Z\t, 1/2} for every t G N, so 
that this inductively defines a (random) choice of h* G S^x- 

For each t G N, let Yt = hl{Xt). Now fix any algorithm A, and consider the 
sequence Yt of predictions the algorithm makes for points Xt, when the target 
sequence h* is chosen as above. Then note that, for any t G N, since Yt and Bt 
are independent. 


{Yt ^Yt)>¥. [P {ft ^ Yt\Yt, 4>t-i) 


> E 

> E 


,l/2}'7r ^ min{Z\t,l/2}7r 


Furthermore, since min{Z\t, l/ 2 } 7 r < 7 r/ 2 , the regions {x : /i 0 t_i+min{/it,i/ 2 }ir(a;) f 
^ 0 t_i(a;)} and {x : /i 0 t_i-min{zit.i/ 2 } 7 r(a;) f (x)} have zero-probability over¬ 
lap (indeed, are disjoint if A; < 1 / 2 ), the above equals min{At, 1 / 2 }. 

By Fatou’s lemma, linearity of expectations, and the law of total expectation, 
we have that 


E 


lim sup —E 

T —^C30 Y 


Y.l[YtfYt] 


.t^l 


> 


1 

lim sup - V P ( 1 / Yt\ 


T 


> lim sup — ^min{At,l/ 2 }. 


T-s-c 


t=l 


Since J2t=i f o{T), the rightmost expression is strictly greater than zero. 
Thus, it must be that, with probility strictly greater than 0, 


lim sup — E 

T —¥oo ^ 


r T 


Y.^[YtfYt] 

.t=l 


h* 


> 0 . 


In particular, this implies that there exists a (nonrandom) choice of the sequence 


h* G S'/! for which E 


1 Ft Tt f o{T). Since this holds for any choice 


of the algorithm A, this completes the proof. 


5 Polynomial-Time Algorithms for Linear Separators 

In this section, we suppose At = A for every t G N, for a fixed constant A > 0, 
and we consider the special case of learning homogeneous linear separators in 
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under a uniform distribution on the origin-centered unit sphere. In this case, 
the analysis of [HL94] mentioned in Section 13.31 implies that it is possible to 
achieve a bound on the error rate that is 0{dV^), using an algorithm that runs 
in time poly(d, 1/Z\, log(l/(5)) (and independent of t) for each prediction. This 
also implies that it is possible to achieve expected number of mistakes among T 
predictions that is 0{dy/A) x T. IdMEDVlOpI have since proven that a variant of 
the Perceptron algorithm is capable of achieving an expected number of mistakes 
d{{dAfl^) X T. 

Below, we improve on this result by showing that there exists an efficient 
algorithm that achieves a bound on the error rate that is 0{\fdA), as was possible 
with the inefficient algorithm of |HL94ILon99] mentioned in Section 13.11 This 
leads to a bound on the expected number of mistakes that is O(VdA) x T. 
Furthermore, our approach also allows us to present the method as an active 
learning algorithm, and to bound the expected number of queries, as a function 
of the number of samples T, by 0{VdA) x T. The technique is based on a 
modification of the algorithm of [HL94) , replacing an empirical risk minimization 
step with (a modification of) the computationally-efScient algorithm of [ABL13] . 

Formally, define the class of homogeneous linear separators as the set of 
classifiers hw ■ ^ {—1,+1}, for re S with ||w|| = 1, such that h^{x) = 

sign(?ii • x) for every x 


5.1 An Improved Guarantee for a Polynomial-Time Algorithm 

We have the following result. 

Theorem 3. When C is the space of homogeneous linear separators (with d > 4) 
and V is the uniform distribution on the surface of the origin-centered unit sphere 
in R'^, for any fixed Z\ > 0, for any S G (0,1/e), there is an algorithm that runs 
in time poly((i, 1/Z\, log(l/5)) for each time t, such that for any h* G Sa, for 
every sufficiently large t G N, with probability at least 1 — 5, 


eitiht) = O 



Also, running this algorithm with 5 = \/ AdhAj e, the expected number of mistakes 
among the first T instances is O (^■sjAdlog ■ Furthermore, the algorithm 

can he run as an active learning algorithm, in which case, for this choice of 5, the 
expected number of labels requested by the algorithm among the first T instances 

*so(yZdiog3/^(^)r). 

® This work in fact studies a much broader model of drift, which in fact allows the 
distribution V to vary with time as well. However, this 0((dA)^f‘^) x T result can 
be obtained from their more-general theorem by calculating the various parameters 
for this particular setting. 
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We first state the algorithm used to obtain this result. It is primarily based 
on a margin-based learning strategy of [ABL13] , combined with an initialization 
step based on a modified Perceptron rule from [DKMOQICMEDVin] . For r > 0 
and a; S K, define = max {0,1 — Consider the following algorithm and 
subroutine; parameters 5k, ruk, Tk, Vk, bk, a, and k will all be specified in the 
context of the proof; we suppose M = 

Algorithm: DriftingHalfspaces 

0. Let ho be an arbitrary classifier in C 

1. For 1 = 1,2,... 

2 . h ^ 


Subroutine: ModPerceptron(t, h) 

0. Let wt be any element of with ||i(;t|| = 1 

1. For m = t-|-l,t-|-2,...,t-|- mo 

2. Choose hm = h (i.e., predict = h{Xm) as the prediction for Wi) 

3. Request the label 

4. lih^^_,{X^)^Yra 

b. '^m t '^m—1 ' X^n^X^yi 

6. Flse Wm t— Wm -1 

7. Return Wt+ma 


Subroutine: ABL(t, h) 

0. Let Wo be the return value of ModPerceptron(t, h) 

1. For k = 1,2,..., |■log 2 (l/a)l 

2. Wk ^ {} 

3. Fot s = t + rrij + l,...,t + X;*=o 

4. Choose hs = h (i.e., predict Yg = h{Xg) as the prediction for Yg) 

5. If |wfc_i • Xg\ < bk-i, Request label Yg and let Wk ^ Wk U {(Ag, Yg)} 

6. Find Vk G with \\vk - iCfc-iH < r^, 0 < ||ufc|l < 1, and 

E ir^ivivk ■ x)) < inf x; ■ x)) + i^\Wk\ 

{x,v)&Wk (x,j/)GWfe 

7. Let Wk = 

8. Return /i^riog 2 (i/c)i-i 


Before stating the proof, we have a few additional lemmas that will be needed. 
The following result for ModPerceptron was proven by [CMFDVIO] . 

Lemma 2. Suppose A < Consider the values Wm obtained during the ex¬ 
ecution o/ModPerceptron(t,/i). Vm G {< -I- 1, ..., t -I- mo}, V{x : hw^{x) ^ 
h^^(^x'j'j y— • Furthermore, letting c\ — ^ .400.215 , if 

V{x : hw„,_Jyx) h'^{x)) > 1/32, then with probability at least 1/64, V{x : 
bw,nix) ^ h*^{x)) < (1 - ci)V{x : hyj^_^{x) ^ h^{x)). 


This implies the following. 
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Lemma 3. Suppose A < 4oo.2^ 7 (d+in(4/a)) ■ = niax{ |■l28(l/cl) ln(32)], 

[512 ln(|)]}, with probability at least 1 — (5/4, ModPerceptron(t,/i) returns a 
vector w with V{x : h^ix) ^ ^ 1/16. 

Proof. By Lemma [2] and a union bound, in general we have 

: hwmix) ^ Kn+ii.^)) < : hyj^_.^{x) ^ h*^{x)) + A. (6) 

Furthermore, if V{x : hw^_i{x) ^ h'^{x)) > 1/32, then wth probability at least 
1/64, 

V{x : h^^{x) ^ h^+i(x)) < (1 - ci)V{x : h^^_^{x) ^ h*^{x)) + A. (7) 

In particular, this implies that the number JV of values m G {t + 1,... ,t + mo} 
with either P(x : hy,^_j^(x) h*^{x)) < 1/32 or V{x : h^^\x) /im+i(a)) < 

(1 — ci)V{x : hw„,_i{x) h^{x)) + Z\ is lower-bounded by a Binomial(m, 1/64) 

random variable. Thus, a Chernoff bound implies that with probability at least 
1 — exp{—mo/512} > 1 — (5/4, we have N > mo/128. Suppose this happens. 
Since Z\mo < 1/32, if any m G {t + 1,... ,t + mo} has V(x : hw.^_i(x) 

< 1/32, then inductively applying ([6]) implies that V{x : hwt+^^ix) 

< 1/32 + Amo < 1/16. On the other hand, if all m G {t + 1,..., t + 
mo} have V(x : (x) ^ /i^(x)) > 1/32, then in particular we have N values 

ofmG {t-l-1,..., id-mo} satisfying d?]). Combining this fact with ([6]) inductively, 
we have that 

'P{x : ^ < (1 - cifV{x : V{x) ^ K+^{x)) + Amo 

< (1 _ c^)(lM)ln(32)p(^ ^ ^ ^ + Amo < 

□ 

Next, we consider the execution of ABL(t, h), and let the sets Wk be as in 
that execution. We will denote by w* the weight vector with ||r(;*|| = 1 such that 
h*j^^^j^i = hui*. Also denote by Mi = M — mo. 

The proof relies on a few results proven in the work of [ABL13] , which we sum¬ 
marize in the following lemmas. Although the results were proven in a slightly 
different setting in that work (namely, agnostic learning under a fixed joint dis¬ 
tribution), one can easily verify that their proofs remain valid in our present 
context as well. 

Lemma 4. JABLlSf Fix any k G {1,..., [log2(l/Q!)]}. For a universal constant 
C7 > 0, suppose bk-i = c-j"2}-~^l\fd, and let Zk = \j’f’\l{d — 1) + For a 

universal constant ci > 0, if ||ui* — Wfe_i|| < Vk, 


E 

£rA\w*-x\) 

1 

1 

-E 

Y ^-Tkiyiw* ■ x)) 

Wk-1, \Wk\ 



{x,y)eWk 



{x,y)&Wk 




<ci|VFfc|v'2MMi^. 

Tk 
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Lemma 5. WLl^ For any c > 0, there is a constant c' > 0 depending only on 
c (i.e., not depending on d) such that, for any u,v G with ||u|| = ||r;|| = 1, 
letting a = V{x : hu(x) hy{x)), if a < 1/2, then 


V : hu(x) hy{x) and • a:| > c'—^j < ca. 

The following is a well-known lemma concerning concentration around the 
equator for the uniform distribution (see e.g., [DKMn9IBBZ07IABL13] i: for in¬ 
stance, it easily follows from the formulas for the area in a spherical cap derived 
by |Lillj . 


Lemma 6 . For any constant C > 0, there are constants C 2 ,C 3 > 0 depending 
only on C (i.e., independent of d) such that, for any w € with ||t(;|| = 1, 

VyG [0,C/Vd], 

C 2 'y'/d < V {x : \w ■ x\ < j) < c^jy/d. 

Based on this lemma, |ABL13) prove the following. 

Lemma 7. \ABLF^ For X ^ V, for any w gM.‘^ with ||r(;|| = 1, for any C > 0 
and T,b G [0, C/Vd], for C 2 , C 3 as in Lemma\^ 


E 


tr{\w-X\) 


'•XI < & 


< 


C3r 

C26 


The following is a slightly stronger version of a result of [ABL13] (specifically, 
the size of Wfc, and consequently the bound on |Wfe|, are both improved by a 
factor of d compared to the original result). 


Lemma 8. Fix any 6 G (0, 1/e). For universal constants C4, C5, Cq, C7, Cg, Cg, Cio G 
(0, 00), for an appropriate choice of k G (0,1) (a universal constant), if a = 

cgy^Adlog (^), for every A: G {!,..., |■log2(l/a)l}, if b^-i = cji^-^j^/d, Tk = 


cs2 ^/Vd, rk = cio2 ^, Sk = S/{\\og2i^/a)']-ky, andruk = 


and if'P{x : hwk-ii^) 7^ hw*{x)) < 2~^~^, then with probability at least 1 — 

(4/3)4, \Wk\ < ce^dlog and V{x : hy,,^{x) hy,*{x)) < 2 


•) —/c—4 


Proof. By Lemma IHl and a Chernoff and union bound, for an appropriately large 
choice of C5 and any C7 > 0, letting 02,03 be as in Lemma[n](with C = cr\/{cs/2)), 
with probability at least 1 — 4/3, 


02C72 ^ruk < \ Wk \ < 403072 ''wfc. 


(8) 


The claimed upper bound on |B4| follows from this second inequality. 
Next note that, if V{x : hw^_i{x) hw*{x)) < 2~^~^, then 


max{4fc(2/(w* ■ x)) : X G \wk-i ■ x\ < bk-i,y G {-1, -fl}} < cuVd 
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for some universal constant cn > 0. Furthermore, since V{x : hu]^_^{x) ^ 
hw*{x)) < we know that the angle between Wk-i and w* is at most 

2 “^“^7r, so that 

||wfc_i — tc*|| = \/2 — 2wk-i ■ w* < y^ 2 — 2cos(2“^“37r) 

< yl2-2cos^{2->=-H) = ^/2sin(2-'=-37^) < 2-'=-^Try/2. 

For cio = tt\/22~^, this is r^. By Hoeffding’s inequality (under the conditional 
distribution given |B4|), the law of total probability, Lemma lU and linearity of 
conditional expectations, with probability at least 1 — <5^/3, for X ^ 7^, 


{x,y)eWk 


Wk- 1 ,\wk-i ■ x\ < bk -1 


+ ci|lFfc|x/2MMi^ + J|VFfe|(l/2)c2idln(3/4). (9) 

Tfc * 

We bound each term on the right hand side separately. By Lemma [71 the hrst 
term is at most |W4l eX-i "" Next, 


Zk _ \/cIq2 1) + 4cf2 ^ y^2cfo + 4c^ 


Tk 


C82-'=/yd 


cs 


while 2^ < 2/a so that the second term is at most 


Cs V a 


Noting that 


pogjll/a)! 

o2C5 1 ,, 

Ml = > ruk' < —5-dlog 


fc' = l 


we find that the second term on the right hand side of m is at most 


8ci \f2i 


10 


4ci 


Cg K 


Cs 


iw'fch 


'^d\og (M) _ 8ci^ v^2cfo + 4cf 


( 10 ) 


CsCg 




Finally, since dln(3/i5fc) < 2d\n{l/5k) < ^^2 and (|8]) implies 2 ^nik < 

the third term on the right hand side of (0) is at most 




CllK 

■V/C 2 C 5 C 7 


Altogether, we have 


y: ■ A) < Iiui (|gi + v'2cf„ + 44 ^ 
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Taking Cg = 1 /k^ and cs = k, this is at most 

'''"'‘I + 8c.^^2c;„ + 44 + - 0 =j . 

Next, note that because h^j^ix) y => ^Tkivi^k ■ x)) > 1, and because (as 
proven above) ||tc* — tCfe_i|| < 

\Wk\erwdhwk) < X! ^rdvi^k ■ x)) < Y. irdviw* ■ X)) + K\Wk\. 
{x,y)eWk (x,y)&Wk 

Combined with the above, we have 

\Wk\eiWk{h^k) < i^\Wk\ ^1 + + 8ci^^J2cIq + 4c^ + --^===^ . 

Let C12 = 1 + 2^ + 8ciyc^\/2c^(7+^ + Furthermore, 


|bF/c|erWi,(hu,^) — ^ ( '^[hwki^) y] 

{x,v)^Wk 

> Y, ^hwkix) ^ hw»{x)\- Y, l[hu,*(a:) y]. 

(x,y)elTfe {x,y)(^Wk 

For an appropriately large value of C5, by a Chernoff bound, with probability at 
least 1 — <5fe/3, 

t +Xij = o 

Y l[hyj*{Xs) ^ Fs] < 2eZ\Mimfe + log2(3/4)- 
s=t+Z)jLo xrij + l 

In particular, this implies 

Y ![*«>• {x) ^ y] < 2eAMnnk + log2(3/4), 

{x,y)&Wk 

so that 


Y M^xukix) i^h^*{x)] < |H4|erwJ/iu,J + 2eZ\MiTOfc + log2(3/4)- 

ix,y)&Wk 


Noting that m and dH) imply 


AMiruk < A 


32c 5 dlog(^) 


-\Wk\ < 


32c5 


C2C7c|k2 


f4zsdlc,g(i) 
a2‘|m| = 15£5^1a2‘|H2t| < 


C2C7C9/s:^ y 
32c5K^ 


Z\dlog 

I^Ffel, 


2^\Wk\ 


C 2 C 7 


C2C7 
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and dH) implies log 2 ( 3 / 5 fe) < altogether we have 


^ hw>{x)\ < \Wk\e-^:Wk{hwk) + 


2k^ 


{x,y)eWk 


C 2 C 7 


C 2 C 5 C 7 


-\Wk\ 


iTxr I / 64ec5K^ 2 k \ 

< K|Vhfe| Ci2 + -^ + - . 

C2C7 C2C5C7/ 


Letting ci 3 = 012 +^^ + ^;^, and noting K < 1, we have 

/i™*(a;)] < ci3K|LLfe|. 

Lemma[T] (applied under the conditional distribution given |Wfe|) and the law 
of total probability imply that with probability at least 1 — i5fc/3, 


\Wk\V (x : hyj^{x) ^ hn,-{x) \wk-i ■ x\ < bk-i^ 


< l[hi,^{x) ^ h^*{x)\ + ci 4 \/|fhfe|(dlog(|Vhfe|/d) + log(l/4)), 

{x,y)^Wk 

for a universal constant C 14 > 0. Combined with the above, and the fact that 
(HI) implies log(l/4) < -^^^\Wk\ and 


d\og{\Wk\/d) < dlog 


< dlog 


/ 8c3C5C7log(^) 

1 ^2 


8c3C5C7\ 

K^5k ) 


< 31og(8max{c3, l}c 5 )c 5 dlog —— 

KOfc 




C2C7 


we have 

^ hyj*{x^ * x\ ^ bk—i^ 


< Ci 3 K|VLfc| + Cuy \Wk\ 
= K\Wk\ Ci 3 + Ci 4 


31 og( 8 max{c 3 , 1 }) 


C2C7 


Wk\ + 


C 2 C 5 C 7 


-\Wk\ 


31og(8max{c3,1}) 


C2C7 


C2C5C7 


Thus, letting C 15 = ^Ci 3 + cm , we have 

(^x : hwk{x) ^ hyj>{x) \wk-i ■ x\ < bk-i^ < C15K. ( 11 ) 


V\ 


Next, note that ||ufc - Wk-iW^ = ||ufe|p + 1 - 2||xfe|| cos( 7 rP(x : hw^ix) ^ 
hwk-i{x))). Thus, one implication of the fact that \\vk — Wk-i\\ < Vk is that + 
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1 T'^ 

< cos(7rP(x : h^^.(x) ^ hwf._^{x)))] since the left hand side is positive, we 
have V(x : hw^{x) ^ hw^_^{x)) < 1/2. Additionally, by differentiating, one 
can easily verify that for ([> G [0,7r], x i—>■ y/x^~+Y^-'2aFcos(^ is minimized at 
X = cos((/>), in which case -yx^ + l^^^xcos^^ = sin(^). Thus, ||t;fc — tCfe-i|| > 
sin(7rP(x : hw^^x) ^ hiu^_^{x))). Since ||z;fe - tcfc-i|| < Vk, we have sin(7rP(a; : 
hwk(x) ^ hwk-iix))) < Tk- Since sin(7rx) > x for all x G [0,1/2], combining 
this with the fact (proven above) that V{x : hw^(x) ^ hw^-Ax)) < 1/2 implies 
'Pix I hyj^{x) ^ 

hwk-1 (x)) < Tk- 

In particular, we have that both V{x : huj^{x) ^ hw^_j^(x)) < and V{x : 
hw* (x) A ^wk-i ( 2 ^)) ^ 2~^~^ < Tk- Now Lemma [5] implies that, for any universal 
constant c > 0, there exists a corresponding universal constant c' > 0 such that 


and 


V ^ h^^_Ax) and \wk-i ■ x\ > 


V [x : (x) ^ (x) and jiCfe-i • x| > - ^'^k, 


so that (by a union bound) 


V ^x : h^Ax) A hwAx) and |wfc_i • x| > 

< P ^x : (x) A hwk-i {x) and |wfc_i • x| > 

+ P ^x : hw*{x) A hwk-Ax) and |rcfc-i • x| > — ‘^^’’’k- 

In particular, letting c^ = c!cwl2, we have c'-^ = bk-i- Combining this with 
(fTTl) . Lemma [HI and a union bound, we have that 


P (x . (x) A h-uj* (^)) 

< P (x : hwAx) A hwAx) and • x| > hk-i) 

+ P (x : (x) A hw* (x) and |wfe_i • x| < bk-i) 


< 2crfc +V (^x : hy^Ax) A h^Ax) \wk-i ■ x\ < bk-i'^ V (x : |wfc-i • x| < hk-i) 


< 2crk + CisKCsbk-iVd = (2®ccio + Ci 5 KC 3 C 72 ^) 2 


Taking c = and k = we have V{x : hy,Ax) A hwAx)) < 2“'"“'‘, 

as required. 

By a union bound, this occurs with probability at least 1 — (4/3)i5fc. □ 


Proof (Proof of Theorem\^. We begin with the bound on the error rate. If Z\ > 
400 - 2 a 7 (d+in( 4 /, 5 )) ’ trivially holds, since then 1 < ^22^ y'Z\(d + ln(4/i5)). 

Otherwise, suppose A < 4 oo. 2 ^^(d+in( 4 /j)) ■ 












22 


Steve Hanneke, Varun Kanade, and Liu Yang 


Fix any i G N. Lemma |3] implies that, with probability at least 1 — ^/4, 
the Wo returned in Step 0 of ABL(M(i — satisfies P(x : h^g{x) yf 

^M(i-i)+mo+i(*)) — 1/16. Taking this as a base case, Lemma[8]then inductively 
implies that, with probability at least 



flogs (l/a)l C C 

E , , 0 0 


l + (4/3)^ 


e =2 



> l-(5. 


every A: G {0,1,..., |■log2(l/Q^)]} has 

V{x : hu,^{x) ^ < 2 , (12) 

and furthermore the number of labels requested during ABL(M(i — 
total to at most (for appropriate universal constants £1,62) 


Too 


flogs (l/a)l 

+ E 


\Wk\ < Cl d + In ( — I + 


flogs(l/a)l 

E 


([log2(4/a)] - kf 


< C2dlog 




In particular, by a union bound, (HU implies that for every fc G {1,..., |’log2(l/a)] }, 
every 

{ k-l k 'I 

M(i — 1) + ^ rrij + 1,..., M(i — 1) + ^ rrij > 
i=o j=o J 

has 


V{x : h^^_^{x) ^ h*^{x)) 

— 'Pix '■ hyjk-i{^) ^ ^M(i-l)+mo + l(^)) Vix : /ijif(j_i)^.^p_|_i(x) ^ h^{x)) 

< 2-'=-^ + AM. 


Thus, noting that 



with probability at least 1 — (5, 


■ ^»riogs(i/c)i-i(^) ^ < 0{a + AM) = O (J Ad log 
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In particular, this implies that, with probability at least 1 — every t C {Mi + 
1,..., M(i + 1) — 1} has 

ert{ht) < V{x : (a;) ^ h*M,{x))+V{x : h*M,{x) ^ h*{x)) 


< O 



+ AM = O \ Adlog 



which completes the proof of the bound on the error rate. 

Setting S = V Ad, and noting that l[lt ^ ^t] < !> we have that for any 
t > M, 


(ft = E eitiht) iJ Adlog ^ ^ ^ if (^) 


Thus, by linearity of the expectation, 

T 


E 


El fff 


< M + O [ \ Adlog 




Ad) 


T . 


Furthermore, as mentioned, with probability at least 1 — <5, the number of labels 
requested during the execution of ABL(M(i — 1), hi_i) is at most 


O 1 dlog 1 ^ ) log ( T 


Thus, since the number of labels requested during the execution of ABL(M(f — 
1), cannot exceed M, letting 5 = V Ad, the expected number of requested 
labels during this execution is at most 




+ y/MM < O 



= o 





Thus, by linearity of the expectation, the expected number of labels requested 
among the first T samples is at most 


O dlog^ 


1 


T' 

M 


= o ( yZdiog^/^ 


Ad 


which completes the proof. 


□ 


Remark: The original work of |CMEDV10] additionally allowed for some number 
K of “jumps”: times t at which At = 1. Note that, in the above algorithm, since 
the influence of each sample is localized to the predictors trained within that 
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“batch” of M instances, the effect of allowing such jumps would only change 
the bound on the number of mistakes to O dAT + ■ This compares 

favorably to the result of [CMEDVlO] . which is roughly O (^dAY^^T + . 

However, the result of |CMEDV10j was proven for a more general setting, al¬ 
lowing distributions V that are not uniform (though they do require a relation 
between the angle between any two separators and the probability mass they 
disagree on, similar to that holding for the uniform distribution, which seems 
to require that the distributions approximately retain some properties of the 
uniform distribution). It is not clear whether Theorem |3] can be generalized to 
this larger family of distributions. 

6 General Results for Active Learning 

As mentioned, the above results on linear separators also provide results for the 
number of queries in active learning. One can also state quite general results on 
the expected number of queries and mistakes achievable by an active learning 
algorithm. This section provides such results, for an algorithm based on the 
the well-known strategy of disagreement-based active learning. Throughout this 
section, we suppose h* G Sa, for a given A G (0,1]: that is, V{x : Y 

htix)) < A for all t G N. 

First, we introduce a few definitions. For any set 'H C C, define the region of 
disagreement 

DIS(H) = {x G X : 3h,g G % s.t. h{x) Y 9{x)}. 

The analysis in this section is centered around the following algorithm. The 
Active subroutine is from the work of |Hanl2| (slightly modified here), and is a 
variant of the (Agnostic Acive) algorithm of [BBL06) ; the appropriate values 
of M and Tfe(-) will be discussed below. 

Algorithm: DriftingActive 
0. For i = 1, 2,... 

1. Active(M(f — 1)) 


Subroutine: Active(t) 

0. Let ho be an arbitrary element of C, and let Vb ^ C 

1. Predict Fj+i = ho{Xt+i) as the prediction for the value of Yt+i 

2. For fc = 0, l,...,log2(M/2) 

3. Qk •<— {} 

4. Fors = 2'= + l,.. .,2'=+! 

5. Predict Ys = hk{Xs) as the prediction for the value of lb 

6. IfX, GDIS(Vfc) 

7. Request the label lb and let Qk ^ Qk^ L;)} 

8. Let hk+i = argmin^^gv^ E(x.y)GQfc ^ y\ 

9. Let Vfc+i ^ {/i G Vfc : Y.{x,y)<^Q^ l[h{x) Y v] - Mhk+i{x) Y v] < Tk} 
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As in the DriftingHalfspaces algorithm above, this DriftingActive algorithm 
proceeds in batches, and in each batch runs an active learning algorithm de¬ 
signed to be robust to classification noise. This robustness to classification noise 
translates into our setting as tolerance for the fact that there is no classifier in 
C that perfectly classifies all of the data. The specific algorithm employed here 
maintains a set 14 C C of candidate classifiers, and requests the labels of samples 
Xg for which there is some disagreement on the classification among classifiers in 
Vk- We maintain the invariant that there is a low-error classifier contained in 14 
at all times, and thus the points we query provide some information to help us 
determine which among these remaining candidates has low error rate. Based on 
these queries, we periodically (in Step 9) remove from 14 those classifiers making 
a relatively excessive number of mistakes on the queried samples, relative to the 
minimum among classifiers in 14. All predictions are made with an element of 

Vkt\ 

We prove an abstract bound on the number of labels requested by this al¬ 
gorithm, expressed in terms of the disagreement coefficient [HanOTj . defined as 
follows. For any r > 0 and any classifier h, define B{h,r) = {g G C : V{x : 
g{x) ^ h{x)) < r}. Then for rp > 0 and any classifier h, define the disagreement 
coefficient of h with respect to C under V: 

Oh{ro) = sup -. 

r>ro r 


Usually, the disagreement coefficient would be used with h equal the target con¬ 
cept; however, since the target concept is not fixed in our setting, we will make 
use of the worst-case value of the disagreement coefficient: 0c(^o) = sup^jg^ dhifo)- 
This quantity has been bounded for a variety of spaces C and distributions V 
(see e.g., [Han07IEYW12IBL13j l. It is useful in bounding how quickly the region 
DIS(14) collapses in the algorithm. Thus, since the probability the algorithm 
requests the label of the next instance is 7^(DIS(T4)), the quantity 9c{to) nat¬ 
urally arises in characterizing the number of labels we expect this algorithm to 
request. Specifically, we have the following result H 


Theorem 4. For an appropriate universal constant ci G [l,oo), ifh* G Sa for 


some A G (0,1], then taking M = 


cn 


, andTk = log2(l/-\/d^) + 2^^+^eA, 


and defining ca = V dALog{l/{dA)), the above DriftingActive algorithm makes 
an expected number of mistakes among the first T instances that is 


O {eAhog{d/A)T) = 6 T 


and requests an expected number of labels among the first T instances that is 

O( 0 c(ezi)ezvLog(d/A)r) = 6 (dc(VdA)v^) T. 

^ One could alternatively proceed as in DriftingHalfspaces, using the final classifier 
from the previous batch, which would also add a guarantee on the error rate achieved 
at all sufficiently large t. 

® Here, we define \x ~\2 = ^ for x>l. 
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The proof of Theorem 0] relies on an analysis of the behavior of the Active 
subroutine, characterized in the following lemma. 


Lemma 9. Fix any t G N, and consider the values obtained in the execution 
o/Active(t). Under the conditions of Theorem^ there is a universal constant 
C 2 G [l,oo) such that, for any k G {0,1,..., log2(M/2)}, with probability at 
least 1 — 2\fdA, if h*j^-^ G 14, then h*j^-^ G 14+i and sup^gy^^^ V{x : h{x) ^ 
hl^^{x)) < C22~^dLog{ci/VdA)- 


Proof. By a Chernoff bound, with probability at least 1 — VdA, 

2k+l 

Y, nK+iiXs) ^ n] < log2(l/v^) + 2"'=+"eZ\ = ffe. 

Therefore, if h*_^_l G 14, then since every g G 14 agrees with h^_^_i on those points 
As ^ DIS(14), in the update in Step 9 defining 14+i, we have 


Y ¥= y] - Mhk+i{x) ^ y] 

U,v)eQk 


= 'Y^ l[^t+l(^s) ~ 'Y/ ^ 

geVk ~ 

s=2'“ + l s=2'= + l 


2^+1 

< ^ l[/i:+i(As) ^ As] < Tfc, 

8=2'= + ! 


so that h^j^i G 14 +1 as well. 

Furthermore, if 4*^^ ^ 14, then by the definition of 14+i, we know every 
h G Vk+i has 


2fcd"i 2^”^^ 

^ i[MAs)^y8]<rfe+ ^ i[h:+i(As)^rs], 

8=2'= + ! 8=2'' + ! 

so that a triangle inequality implies 

2fc+i 2^'^^ 

Y mxs)^K^^{x,)]< Y mxs)^Y,] + i[hi^,{x,)^Y,] 

s=2'= + l 8=2'= + ! 

2*,+! 

< ffc + 2 ^ l[/ii+i(As) ^ As] < 3ffc. 
8 = 2 '= + ! 
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Lemma [T] then implies that, on an additional event of probability at least 1 — 
VM, every h € 14 +i has 


P(x : h(x) ^ ht+i{x)) 

< 2-'=3ffe + c2-'=Y^3ffc(dLog(2Vd) +Log(l/VdZ)) 

+ c2-'=(dLog(2Vd) + Log(l/v^)) 

< 2~^ 2>\og2{\/VdA) + 2^12eZ\ + c2~^\Jq \og 2 {l / y/ dA)dljOg[ci / 

+ c2~^\J 2'^^24,eAdLog{ci/'/dA) + 2c2“^dLog(ci / V dA) 

< 2“^3 log 2 (l/VdZ\) + 12eci'/dA + 3c2“*-\/dLog(ci/VdZ\) 

+ V 24ec dZ\Log(ci / V dZ\) + 2c2“^(iLog(ci/’\/(M), 


where c is as in Lemma [T] Since V dA < 2cid/M < cid2 this is at most 

^5 + 12eci + 3c + \/24ecci + 2cJ 2~'^dLog{ci/VdA). 

Letting C 2 = 5 + 12eci + 3c+ \/24ecci + 2c, we have the result by a union bound. 

□ 


We are now ready for the proof of Theorem 01 


Proof (Proof of Theorem^. Fix any i C N, and consider running Active(M(* — 
1)). Since € C, by Lemma 01 a union bound, and induction, with 

probability at least 1 — 2\/dA log2(M/2) > 1 — 2\/dA log 2 (ciy^d/A), every k G 
{0, l,...,log 2 (M/ 2 )} has 


sup P(a: :/i(a::) ^(a;)) < C22^ ''dLog(ci/v^^). (13) 

/iGVfc 


Thus, since hk G 14 for each k, the expected number of mistakes among the 
predictions FM(i-i)+i, ■ • ■, lAfi is 
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log2(M/2) 2*’+'- 

1+ E E ¥{hk{XM(i-l)+s) ^ yM{i-l)+s) 

k=0 s=2'= + l 

log2(M/2) 2'“+^ 

^1+ E E l{XM{i—l)+s) ^ yM{i—l)+s^ 

k=0 s=2'' + l 

log2(M/2) 2'“+^ 

+ E E ]P{hk{XM(i-i)+s) 7^ ^M{i-l) + li^M(i-l)+s)) 

k=0 s=2'' + l 

log2(M/2) 

< 1 + AM'^ + ^ 2'= (c22i-'=dLog(ci/v^) + 2v^log2(M/2)) 

< 1 + dcjfi + 2c2dLog(ci/-\/^) log2(2ci-\/d/Z\) + 4cidlog2(ci-\/d/Z\) 

= O (dLog(d/Z\)Log(l/(dZ\))). 

Furthermore, m implies the algorithm only requests the label yM(i-i)+s for 
s e {2fc + l,...,2'=+i}ifXM(.-i)+« e DIS(B(/r*^(^_^)^^,C22i-'=dLog(ci/v^))), 
so that the expected number of labels requested among yM(i-i)+i! • ■ ■ i Xmz is at 
most 


log2(M/2) 

1+ ^ 2'=(E[iP(DIS(B(/i*^(,_i)+i,C22i-'=dLog(ci/y^))))] 

+2'/dA log2(ci d/A)j 

< 1 + ^4c2dLog(ci/-\/fM)/M^ 2c2(iLog(c2/'\/^) log2(2ci \/d/A) 

+ 4cidlog2(ci ^Jd/A) 

= O {Oc {ydAhogil/{dA))^ dLog((i/Z\)Log(l/(dZ\))^ . 

Thus, the expected number of mistakes among indices 1,..., T is at most 


O I dLog((i/Z\)Log(l/((iZ\)) 



= O (v^Log(d/Zl)Log(l/(dZ\))r) , 


and the expected number of labels requested among indices 1,..., T is at most 


O (v^fMLog(l/(dZ\))^ dLog((i/Z\)Log(l/(dZ\)) 

= O {Oc ^•\/^Log(l/(dZ\))^ •\/(MLog((i/Z\)Log(l/(dZ\))r^ . 


T' 

M 


□ 
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